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Abstract 


In this paper, we address the problem of multi-label classification. We consider 
linear classifiers and propose to learn a prior over the space of labels to directly 
leverage the performance of such methods. This prior takes the form of a quadratic 
function of the labels and permits to encode both attractive and repulsive relations 
between labels. We cast this problem as a structured prediction one aiming at 
optimizing either the accuracies of the predictors or the Fi-score. This leads to 
an optimization problem closely related to the max-cut problem, which naturally 
leads to semidefinite and spectral relaxations. We show on standard datasets how 
such a general prior can improve the performances of multi-label techniques. 


1 Introduction 

Multi-label classification aims at predicting a set of labels for each data instance [26, 28]. This 
setting is ubiquitous in real-world applications and for example can take the form of video or text 
tagging, where the goal is to assign instances to categories [14]. Eor video, [27] proposes to consider 
the problem of labeling scenes, on which several objects appear. 

One of the main difficulties of this problem lies in the fact that the space of potential labelings y is 
exponentially bigger than the set of labels V. Doing an exhaustive search over the space of labelings 
is thus not possible. Moreover, contrary to the standard binary classification setting, the set V has a 
specific structure and one has to take it into account, especially when the number of labels is large. 
Indeed, imagine that we are given one classifier /„ for each u S V, we would probably observe 
that some /„ predict labels that are not actually present; for instance, in image tagging, if it is very 
likely to see a zebra and a lion on the same image, it is rather not probable to see a reindeer with 
a lion. A prior over labels could have, for instance, penalized the prediction of a reindeer together 
with the lion. Incorporating structure into the label set can be done a priori by assuming labels 
are organized in a certain hierarchy [21]; [13] incorporates a prior knowledge when training the 
classifiers, permitting to learn correlated classifiers. However this prior does not affect the way 
predictions are done. 

Our goal is to learn such a prior over labels directly from data, at the same time that classifiers are 
learnt. This idea has already been tackled by [20] who restricted their study to the specific case of 
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incorporating positive affinities between labels. We go beyond this approach and propose a model 
permitting to take into account affinities and incompatibilities between labels. 

Related work. 

A large part of the recent literature considers a moderately large set of labels V (order of hundreds) 
and a huge space of labelings y. In this setting it is possible to learn specific classifiers for each label 
separately. One way to train such classihers is the well-known one-versus-rest technique (a.k.a. bi¬ 
nary relevance technique [26]). 

Within this setting, some approaches use the structured prediction framework [25, 23] as we do. 
This corresponds to considering the task of prediction as being a task over the huge output space y. 
[18] has proposed to plug a model within a structured SVM, and considers the prior knowledge 
between labels as hxed a priori, whereas we aim at learning it. They defined a proper loss and 
the corresponding loss-augmented decoding. The loss they used is called the “max loss” and is 
slightly related to the Hamming loss. This approach leads to an efficient loss-augmented decoding, 
and avoids an exhaustive search over the power set y. Other approaches [5] considered the direct 
optimization of the Fi-score within a structured SVM. Another part of the recent literature dealing 
with multi-label classihcation [2] considers the case where the space of labels V itself is huge. In 
these papers, the goal is to use the fact that only few labels are present in an instance. This allows to 
reduce the dimension of the prediction space and performing the labeling over a lower dimensional 
space. The priors we propose here could be combined with these approaches. 

Contributions. Our contribution is four-fold; (1) we propose a model with priors for multi-label 
classihcation allowing attractive as well as repulsive weights, (2) we cast the learning of this model 
into the framework of structured prediction using either Hamming of Fi losses and propose an ap¬ 
proach for solving exactly the loss-augmented decoding using the Fi loss, (3) we propose semidef- 
inite and spectral relaxations to efficiently solve the resulting structured prediction problem, (4) we 
show on real datasets how the learning of such a general prior can improve the multi-label prediction 
over the models where no prior is learnt or when only attractive weights are allowed. 

2 Structured Prediction for Multi-Label Classification 

In this section, we review several ways to perform the multi-label classihcation task when a prior 
over the labels is hxed. Decoding consists in assigning potentially several labels to a data point 
belonging to some feature space. We then discuss how to learn the parameters of the predictive 
function. For the rest of the paper we denote our feature space by A C 

2.1 The multi-label classification problem problem 

Let us consider the set of possible labels V of cardinal V. We define the set of labelings, as the set of 
binary vectors y = {—1,1}^. The set y is the one on which we perform our structured prediction. 

Let us assume that for each possible label v, we are given a linear classifier parameterized by Wy G 
We denote hy W G the vertical concatenation of all the vectors Wy. In the multi-label 

setting, the decoding problem is: 

y{x-,W) G argmax D{x,y,W) := x. (1) 

This is usually referred to as the binary relevance method for multi-label learning [26]. 

The aforementioned approach does not take into account any dependency between the different 
labels. A way to do so is to penalize the discriminative function by some penalty F depending on 
the subset of predicted labels. In our case, we propose to consider: 

yG argmax D{x,y\W, F) ■= y^ x — F(y). (2) 

ye{-i.i}'" 

However, not all functions F are admissible, so that (2) remains tractable since |V| = 2^. 

A class of penalizations that are well-suited for our problem is the class of submodular functions 
[1, 20]. When F is submodular, the decoding becomes the maximization of a supermodular function 
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(maximization of a modular minus a submodular function). This is known to be tractable (solvable 
in polynomial time in V). [20] has proposed to use a graph-cut based penalty. This corresponds 
to F{y) = Ay — y^b where b G and A G is proportional to the Laplacian matrix 

of a graph. Intuitively, this corresponds to considering that labels are organized in a graph G with 
non-negative weights, encoding attractive affinities between the labels; the linear part of the prior b 
corresponds to a prior over the frequencies of the classes. 

For general weights, meaning that the matrix A not only encodes affinities but also costs, the de¬ 
coding task becomes as hard as solving a max-cut problem. In Sec. 5.1 we review common convex 
relaxations permitting to obtain a good approximate solution in polynomial time. Using a matrix A 
with arbitrary entries, our decoding model becomes: 

y{x-,W,A,b)G argmax D{x,y,W, A,b) = argmax y~^W^x + y^b — y^Ay. (3) 
ye{-l,l}'^ yG{-l.l}'^ 


2.2 Learning the parameters W, b and A 

In the previous section we have assumed that we are given V linear classifiers Wy G a linear 
prior b G and a matrix A G Thus, the discussed decoding problem can be seen as being 

parameterized by W, b and A. 

Suppose that we are given N examples {xi,yi) G X xy,i = - ■ ,N, and consider a loss function 

between two labelings £ : y x y ^ R+. Ideally, given this loss, we would like to minimize the 
following regularized empirical loss: 


1 , 

w Tr X! A),yi) + AU(W, A), (4) 

i=l 

where U is a convex regularizer (typically a squared £ 2 -norm) over the parameter space. This is 
a hard combinatorial problem that thus needs to be relaxed. Following [25, 23], we define the 
structural hinge loss H as: 

H{x^,y^,W,A,b) = max {£{y,y,) + D{xi,y;W, A,b) - D{x^,yi,W, A)} . (5) 

ye{-i,i}'^ 

We estimate parameters W*, b* and A* by solving the following problem: 

1 ^ 

min —'^H{xi,yi;W,A,b) + Xn{W,A,b)- (6) 

W,A,b iV 


3 Performance Measures and Losses for Multi-Label Tasks 


In order to set up the aforementioned problem, we need to define a proper loss function £. 


Normalized Hamming loss. The simplest loss is based on accuracy, and is defined as: 

/ ^ V + y^yi . , 

a{y,yi) = ——e[0,i]. (7) 

The loss associated to accuracy is the so-called Hamming loss [12, 28]. It is defined as a linear 
function of the binary label vector y by: 


^{y,y') 


1 . X 1 . 

(1 - 2 /) - - 2 /*) 




2y/v 


^ - yi^y) = 1 - a{y, Vi) e [o, i] 


(8) 

(9) 


where 1 is the U-dimensional vector with ones. This loss corresponds to the symmetric difference 
between two sets AAB = (A U B) \ (A fl B). Note also that, if we consider that not all the errors 
are equivalent, one can use a weighted Hamming loss instead. 


Fi loss. A common choice in the multi-label learning literature is the Fp — score loss [26, 20]. 
This loss is a function of precision and recall and has some important advantages over the Hamming 
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loss. In the common situations where each instance has only few labels among all the ones that are 
possible, the Fp loss penalizes a lot the solution (—1,..., —1)^ while the Hamming does not. 

Precision and recall with respect to a training labeling yi G y are dehned respectively as: 


p{y.yi) 


(1 + yi)^(i + y) 

{l+yy{l + y) ’ 


r{y,yi) 


{l + y^y{l + y) 
(1 + + yi) 


Then the general Fp score is defined as, for every /3 > 0, 


Fp{y.yi) 


(1 +/3^)p(j/,^/^) r{y,yi) 

p{y.yi) Friy^yi) 


( 10 ) 


The most widely used is the Fi score (which turns out to be the harmonic mean of precision and 
recall), and the associated loss is then f(j/, yi) = 1 — Fi{y, yi). More precisely: 


^{y,yi) 


V - y^yi 


2V + y-' 1 + y'l 


e [0,1]. 


Please note the non linear dependency of this loss in y. 


( 11 ) 


4 Loss-Augmented Decoding 


We propose to derive a structured-SVM-like optimization objective [25]. As mentioned earlier, we 
want to learn the parameters of our predictive function using annotated data. Following the definition 
of H, we can write the complete optimization problem (6) as: 


1 


N 

mm — > 

W,A,b N ^ 

i=l 


max {eiyi,y) + y 


vv Xi 




^\ml + ^\\A\\l. (12) 


Using the Hamming loss. If we use the Hamming loss for £, then y) = ^ iy — 2/^2/i)- Our 
optimization problem can be re-written as follows: 


N 


mm 

W,A,b N 


-Y 

N 


max 


y [W Xi + b- —y, ] - y Ay} -y^W Xi-yib + yi Ayt 


iiti^i|2 , -^.4 II .||2 




-Pll^ 


(13) 


Note that the objective function of the optimization is jointly convex but not smooth. 

Using the Fi loss. If in turn we decide to use the Fi loss, the proposed problem is harder because of 
the vector y in the denominator. To cope with this issue, we can split the set y into {V + 1) subsets. 
We define the set as the set of labelings such that k entries are positive: 


Vfc G {0,... U}, yk = {yG {-1, l}'", y^i = 2k-V}. 

As is often done when optimizing the Fi score, which is a contingency-table based loss [ 1 5], we can 
divide the initial problem into V + 1 subproblems by replacing y^ 1 hy 2k — V as follows: 


max 

fce{o.....y} 


V 


V F yA 2k 


max 

y&yk 


yi 


y F yA “f 2k 


F W^Xi -I- 6 ) - y^Ay 


(14) 


The problems of Eq. ( 13)-( 14) above assume that we are able to solve quadratic optimization prob¬ 
lems for y G y. [20] proposes a greedy approximate algorithm for solving this type of problems in 
the specific case where off diagonal entries of the prior A are negative. 

In the following section, we propose relaxations of these problems leading to a tractable loss- 
augmented decoding with no restriction over the matrix A. 
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5 Optimization in y 


So far, we have written three problems that we are not able to solve efficiently. The first one was 
the general decoding of Eq. (3). The other ones were the subproblems of Eq. (13) and Eq. (14). All 
of these are quadratic boolean optimization problems and are closely linked to the max-cut problem 
(see, e.g., [3, Sec. 5.1.5]). These can be written in the canonical form as follows: 

max u^b — u^Au, (15) 

«£{-!, 1 }'^ 

L{u)=0 


where A G b G and L is an affine function. 

Note the presence of the additional constraint L{u) = 0. This additional equation is only needed for 
the problem mentioned in Eq. (14). Eor the two other problems, one can simply ignore it. Eq. (15) 
allows us to tackle three problems in a unified framework. In the next section we discuss two 
relaxations to this problem. Eirst we describe the standard SDP relaxation. We then present how to 
cast this optimization problem as a spectral problem. 


5.1 Classical semideflnite relaxation for max-cut 


The family of problems presented in Eq. (15) is known as the two-way partitioning problems. They 
are a generalization of max-cut, with potentially negative entries in A. Also, they contain an extra 
linear term (see Sec. 5.1.5 of [3]) and potential constraints over the domain. 

There exists a classical semidefinite relaxation. Eollowing [3, 6], we use a similar relaxation to 
the one used by [11] to approximate the max-cut problem. We introduce a new variable U = 
uu^ G Using this notation we can re-write the term Au as Tr (AU). Then using a set of 

constraints that is equivalent toll = uu^ the problem (15) can be re-written as: 

|'Diag(t7) = 1, 

max u^b — Ti-{AU) such that I Rank([/) 1, 

«G{-1.1}'^ \U huu, 

[l{u) = 0. 


Eollowing [3], the convex relaxation of this problem is obtained by removing the rank constraint. 
We define L as the affine function L{u) = vA a — P where a G R'^ and /3 S R. We use the Schur 
complement trick (see, e.g., [3]) and define the matrix M as: 


M = 




Using ey, the vector with all coordinates equal to zero except the last one, our relaxation of (15) can 
be re-written as: 


max 


Tr 


M 


-A 



such that 


Diag(M) = 1, 

MhO, 

Mev = /?■ 


(17) 


Problem (17) can be solved using any standard convex optimization solver at least for small V 
(< 100). When V is large, one can use specific techniques relying explicitly on the fact the solution 
is expected to be low-rank (see, e.g., [16] and references therein). 

Rounding scheme 

At test time, we follow [3] to round the relaxed solution, i.e., get back to some admissible solution 
of (15). We notice that at the optimum (u, U) of Eq. (17), U ^ vAu implies that U — uvA is 
a covariance matrix. Therefore, we simply sample several v ^ J\f{u, U — uu^) from a normal 
distribution, round the solution by taking the signs and choose the best one in terms of the objective 
function. This procedure leads to good feasible points in our experiments. 
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5.2 Spectral relaxation 


The generic problem in Eq. (15) can be rewritten by replacing the integrality constraint u S 
{—1,1}^ with a quadratic equality u^u = V. Please note this makes the problem non-convex. 
Using the same expression for L{u) as in the previous section leads to the following optimization 
problem: 

max vJb — v^Au such that i ^ (1^) 

ugR'^ a = p. 

We deal with the linear constraint by dualizing it, yielding the following problem: 


min 

fi£R 


pP + max (b — pa) — Au 

uGR'^ 

u—V 


(19) 


This can be solved by performing a binary search over p. 


The inner loop problem is classical in optimization, in particular in trust-region methods [9, 22]. It 
reduces—using the Lagrange multiplier technique—to solving a quadratic eigenvalue problem [24]. 
Solving the inner loop problem of Eq. (19) (with nonzero b) is equivalent to finding the minimal 
eigenvalue of the quadratic eigenvalue problem: 

(A^/ — 2XA + — ^{b — pa){b — pa)~^)u = 0, (20) 

where I denotes the V x V identity matrix. The problem above is solved efficiently by performing 
the SVD of the matrix S: 

^ ~ (^-^{b-pa){b-pa)'^ a)’ 

Once this has been solved, we get the desired solution by taking u = ^(A — XI)~^{b — pa), where 
A is the smallest non-zero eigenvalue of S. 

Note that when optimizing the Hamming loss, we get rid of the constraint L{u) = 0. In that case 
we can set /r = 0 and solve the inner loop problem only once. 


5.3 Cheaper (but still efficient) solution for the spectral relaxation 

In this section we present an other way to deal with the spectral relaxation, inspired by [10]. The 
proposed method is more efficient computationnally than the one of the previous section since it 
does not involve solving the binary search problem over the Lagrange multiplier p. 

We start from the problem of Eq. (18). By the change of variables v = and B = 

and by introducing ^ the V dimensional identity matrix) we can write the problem 

as: 


f-A bl2 
\b/2 0 


max Bv such that 

„gRV+l 




( 22 ) 


Eollowing [10], let us simply introduce the QR factorization of the matrix 


a 

0 


0 


QR, where 


Q G ]^t/-i-ixy-i-i orthogonal matrix and R G ]Rt^+tx 2 introduce U = 

Q^V. Ui G and U2 G 
Eq. (22) can be rewritten as: 



max BQU such that 

J/gRV + l 


■U^DU = V 



( 23 ) 
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Note that the last constraint permit to fix the variable Ui = R ^ . With a slight abuse of nota- 

/ A r/2 

tion, corresponds to the inverse of the rotation part. Let us define Q^BQ = ( pT ^2 (J 

with A G , r G and C G Using the previous notations, we get; 

U^Q^BQU = C/J CU 2 + Uj TU 2 + Uj AUi. 


Let us also introduce S = V — U^Ui Since C/i is not entirely determined this problem is equivalent 
to; 


^max_^ C/ 2 ^C'U 2 + j rC /2 such that {UjDU 2 = S . (24) 

Note that we slightly abuse of notations with D being restricted to its last components. 

5.4 Links with graph-cuts 

The min-cut problem can be written as an optimization problem through the following equation; 

V j-l 

min Ci,j\zi - Zj\+ (25) 

^£{0,1}''^^ 

where C G and c G R'^. By making the change of variables z = and carrying on some 

calculations, we get the following equivalent program; 

min 2y^c — y^Cy. (26) 

yef- 1 , 1 }'' 

Therefore, when the matrix A has negative off-diagonal entries, the problem formulated in Eq. (15) 
can be solved using min-cut / max-flow. We can use standard min-cut / max-flow toolboxes by 
providing the matrix C = —A and c = ^b. 

When optimizing the Fi loss, note that we can use the same dualization for the constraint L{u) = 0. 
We proceed exactly as with the spectral relaxation except that the inner loop is solved with min-cut 
/ max-flow. 

5.5 Solving the FI loss augmented decoding for negative A 

In this section, we show how we can solve the constrained problem by relating it to the well-studied 
total variation denoising problem [4, 1 ]. Note that, contrary to [20], in this section, we deal with the 
cardinality constraint exactly and we do not use any approximation algorithm in this specific case. 
We just use total variation minimization algorithm to perform the constrained minimization. 

Here we consider that the constraint of Eq. (15) is simply a cardinality constraint, namely that it is 
of the form vJ c = a for a certain a G {1... U} and c is the V dimensional vector composed of 
ones. 

Now, we dualize this equality constraint by introducing the associated Lagrange multiplier. This 
yields the following problem; 


max min u^Au — u^b + u(a — c) (27) 


Equivalentally, by considering the variable zGjOjlj'^we get the following problem; 


max/r(a —U) min Az — (AAc + 2b) + nz^ c. (28) 

/iSR ze{o,i}'^ 

The problem of Eq.(28) is a separable submodular optimization problem [ 1 ]. Thus solving it can be 
done by considering the associated proximal problem. More precisely, if we introduce the Choquet 
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integral of the cut J{u) (often referred to the “co-area formula” [4] for the specific case of cut 
functions or Lovasz extension for submodular functions), the generic proximal problem associated 
to any cut problem is: 


mm J(m). (29) 

uGR'^ 2 

where g in our case is exactly AAc + 2b. 

This problem is the well known total variation denoising problem. There exists several efficient 
algorithms to deal with it, especially the ones relying on parametric max-flow techniques. Once 
problem(29) has been solved and that we recovered its (unique if a is positive) solution u*, we get 
all the candidates for being a solution of (27) by considering the different Then, we just 

have to compute the associated objective values and select the optimal one. 

6 Optimization in W and A 

We optimize our cost function in Eq. (12) with stochastic subgradient descent. When we relax the 
inner optimization problem in y G we implicitly modify the cost function. Therefore we have to 
be careful when computing the subgradients. 

In this section we provide the derivations in one specific case. The details for the other cases can 
be found in the supplementary material. When using the Hamming loss and the SDP relaxation, our 
cost function becomes, with W = {{U,u),U G G ^ M^M,Diag((7) = ly}' 

Jggw ^ “ Tr(ylC/)| - yj- yjb + yjAyi 

+ ^l|W^II^ + ^ll^lli (30) 

To obtain the subgradients, we first solve the relaxed loss-augmented inference. Using the obtained 

u and U, we compute the sub gradients in W and A as follows: 

dw9(W,A) = XwW + j^J2f^.^^Xi{u-yiy, (31) 

dbg{W,A) = AY:^^^^{u-y^), (32) 

dAg{W,A) = XAA+Aj:^^^-U + y,yJ. (33) 

7 Experimental Evaluation 

We now validate the proposed approach on standard benchmarks. We compare our implementation 
to [20] and to the one-versus-rest model (OvR). The code corresponding to the described method 
will be made publicly available. In this experimental section we first describe the used datasets and 
discuss the baselines to which we compare. 

Datasets. We validate our approach on four datasets. Following [20], we picked our datasets from 
the mulan^ repository. We picked the yeast [7], enron, medical [19] and bibtex [17] datasets. The 
datasets are of various sizes and natures: yeast only has 14 labels while bibtex has 159. All of them 
also present different challenges (different structures, label concurrence patterns, etc.). 

These datasets are given with a train / test split. We further split the training set to generate a valida¬ 
tion set. We select all relevant parameters by plain validation on this set. We report all performances 
on the actual test set as given in the dataset. Caracteristics of these datasets are given in Table 1 . 

One-versus-rest results. In Table 2 we report the performance of a one-versus-rest model for all 
the datasets. For every label, we train a linear classifier using a standard SVM toolbox [8]. We 
select the hyper-parameters by validation on a held-out part of the training set. We compare three 
criteria for choosing the optimal set of regularization parameters. We can either select a common 

’http://mulan.sourceforge.net/datasets.html 
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Instances 

Features 

Labels 

yeast 

2417 

103 

14 

enron 

1702 

1001 

53 

medical 

978 

1449 

45 

bibtex 

7395 

1836 

159 


Table 1: Standard characteristics of the datasets used. 


regularization parameter for all classes (“Single A” column), chosen with the Hamming loss (which 
decouples over classes), or one per class (“Multiple A” column). When choosing a common A for 
all classes, one can choose it according to the Fi or Hamming loss on the validation set. 



Single A 

Multiple A 

Fi 

Hamming 

Hamming 

yeast 

0.39 

0.40 

0.54 

enron 

0.48 

0.49 

0.46 

medical 

0.29 

0.29 

0.28 

bibtex 

0.61 

0.66 

0.66 


Table 2: Linear SVM performance on the considered datasets. We report the average Fi loss for 
various schemes for choosing the regularization parameter A. 


Table 2 shows that it is sometimes important to use the relevant loss as a criterion to select hyper¬ 
parameter. In our experiments, this becomes more and more important as the size of the label set 
increases and thus as discussed in Sec. 3 the Hamming loss behaves more and more differently from 
the FI loss. 

One would also expect that picking one parameter per label would lead to better performance. But 
the benefits from selecting a specific parameter per class is offset by the fact that one cannot use the 
Fi loss in this case. In all our remaining simulations, we use a single A for all classes. 

Our model and comparison to [20]. We run our algorithm—with £ equal to the Hamming loss—on 
all four datasets and compare to the available implementation of [20]. For all methods we select all 
hyper-parameters based on the performance in terms of Fi loss on the validation set. Because of 
the challenging number of labels for bibtex, we were able to run neither the code from [20], nor the 
SDP, in reasonable time. 



OvR 

[20] 

MC 


SDP 



Spectral 


A < 0 

A^O 

Any A 

A < 0 

A ^ 0 

Any A 

yeast 

0.39 

0.36 

0.40 

0.40 

0.39 

0.39 

0.39 

0.37 

0.37 

enron 

0.48 

0.45 

0.47 

0.47 

0.47 

0.45 

0.48 

0.49 

0.49 

medical 

0.29 

0.33 

0.29 

0.31 

0.29 

0.24 

0.30 

0.21 

0.24 

bibtex 

0.61 

N/A 

0.61 

N/A 

N/A 

N/A 

0.62 

0.57 

0.60 


Table 3: Comparison between [20] and different variants of our method. OvR denotes the one- 
versus-rest approach. MC is our algorithm with the inner loop being solved using min-cut / max- 
flow. SDP is the semidefinite relaxation of the inner loop. Spectral is the spectral relaxation of the 
inner loop. 


Table 3 compares the one-versus-rest approach, the approach described in [20] and variants of our 
method. We compare the two relaxations we proposed while optimizing the Hamming loss. Please 
recall that the min-cut (MC) solution implies that Al ^ 0 (non-positive entries). 

When Al ^ 0, we can measure the tightness of the proposed relaxations. We see that the various 
relaxations, SDP then spectral, do not degrade performances over the exact approach MC (which 
cannot be run for general A). 
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We also notice that using a negative matrix ^ is a strong limitation. The performance observed 
when A is unconstrained or non-negative is better. This motivates our formulation and shows that 
repulsive weights between labels are relevant. 



OvR 

[20] 

Our Hamming 

Our Fi 

yeast 

0.39 

0.37 

0.43 

0.43 

enron 

0.48 

0.45 

0.47 

0.47 

medical 

0.29 

0.33 

0.28 

0.28 

bibtex 

0.61 

0.58 

0.60 

0.60 


Table 4: Comparison of Fi losses when optimizing the Fi loss versus the Hamming loss. 


The Hamming loss and the Fi loss. In this experiment we do not make use of the quadratic prior, 
so ^ = 0. Table 7 gives the Fi loss we obtain by optimizing either the Fi loss or the Hamming loss. 
We compare the implementation of the Fi score minimization in [20] (carried out using a greedy 
technique). In that table, “Our Fi” is our own implementation of the support vector technique for Fi- 
loss [15] using the optimization described in Section 4. This is an exact optimization technique. We 
also report the results obtained by training SVMs, using the one-versus-rest scheme. It appears that, 
on these standard datasets (C si 10 — 50), optimizing the Fi loss does not yield better performances 
than optimizing the Hamming loss. 


8 Conclusion 

We have proposed a framework to learn a prior for improving the performances of multi-label clas- 
sihcation tasks. This prior takes the form of a quadratic function over the space of labels and in¬ 
corporates both affinities and negative affinities. Existing work [20] only takes into account positive 
affinities between labels. We provide semidefinite and spectral relaxations of the learning problem, 
yielding to an efficient optimization scheme. In particular the spectral relaxation permits to deal 
computationally with datasets rather large iV > 150) whereas existing algorithms cannot (since the 
loss-augmented decoding problems have to solved many times). 

It would be interesting to see how it is possible to leverage the range of applicability of the semidef¬ 
inite relaxations which is, for now, limited to multi-label problems for which V is of the order of 
hundreds. To that extent, we could use techniques from matrix optimization theory, taking into 
account for the fact that the solution we aim at hnding has low rank [16]. 
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